Screen Shot 2022-12-13 at 2.59.28 PM.png

Establishing a Viable Workspace

In [67]:
# mounting my google drive
from google.colab import drive
drive.mount('/content/drive')
# changing the directory to my folder designated for this project
%cd /content/drive/My Drive/TU/SEMESTERS/f2022/data_final_proj
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/TU/SEMESTERS/f2022/data_final_proj
In [68]:
# importing important packages I will need 
import matplotlib.pyplot as plt
import pandas as pd
!pip install geopandas
import geopandas as gpd
import numpy as np
import requests
import seaborn as sns 
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: geopandas in /usr/local/lib/python3.8/dist-packages (0.12.2)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.9.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from geopandas) (23.0)
Requirement already satisfied: pyproj>=2.6.1.post1 in /usr/local/lib/python3.8/dist-packages (from geopandas) (3.4.1)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.3.5)
Requirement already satisfied: shapely>=1.7 in /usr/local/lib/python3.8/dist-packages (from geopandas) (2.0.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (22.2.0)
Requirement already satisfied: munch>=2.3.2 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: click~=8.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (8.1.3)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2022.12.7)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2022.7.1)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (1.21.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2.8.2)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from munch>=2.3.2->fiona>=1.8->geopandas) (1.15.0)

proj goals.jpg

  1. Determine whether the driver or the constructor plays a larger role in Formula One success

    • I will utilize a regression discontinuity framework to identify the effect that switching teams had to a driver's average lap time, pitstop time, fastest speed, and average starting and finishing positions for a given year. If the driver switches to a 'better' team, it is expected that average lap time and pitstop time will decrease and average fastest speed will increase.
  2. Build a regression model that will predict how a driver's average finishing position will be affected if they switch teams between seasons.

    • I will utilize historical data on the current 10 Formula One teams to build a regression model that will predict how a driver's average finishing position for a will be affected if they switch teams. I will consider the following variables when building my model: lap times, pitstop times, average starting position, average finishing position, fastest lap speed and constructor reference information.

My motivations: F1 is one of the few sports that is almost completely reliant on data. Telemetry from the cars to the pits has been seen in races since the 1980s however, F1 data usage has greatly expanded since then. Dependent on the team, cars can be fitted with upwards of 300 sensors that generate over one million data points per second. Teams are limited in the number of employees that can be present on race day and thus, teams are utilizing the mass collection of data in real time and transmitting said data to analysts and data engineers at an off-site location for immediate feedback. Drivers are able to not only rely on their gut instinct and years of training, but now they can also rely on models and predictions in real time to make difficult decisions. This is why I would consider F1 analytics to be important. It is interesting to me because F1 has been rapidly expanding into the USA recently with social media promotions and deals with Netflix US and their kind of analytics would constitute my dream job and is ultimately why I decided to pursue this as my milestone topic.

Further Reading about Data in Formula One:

  1. Data Collection and Formula One Cars
  2. Amazon Web Services Partnership
  3. The Future of Formula One

DATA: DESCRIPTIONS, LOADING, SCRAPING, & CLEANING

Screen Shot 2022-12-13 at 3.12.01 PM.png

  1. circuits.csv : This dataset provides pertinent information about the different circuits within Formula One. A circuit is a track where a Formula One race occurs. Within this dataset the revelevant variables are: circuitId, circuitRef, location (city), location (country, latitude, longitude, and altitude.
  1. constructor_results.csv : Within Formula One, a constructor is the company or manufacturer that is responsible for the design, development, and construction of a team's car. This dataset summarizes how each constructor did at each race and how many points they received. Within this dataset, the revelant variables are: constructorResultId, raceId, constructorId, & points.
  1. constructor_standings.csv : This dataset highlights how each constructor did at specific races as well as the points their team received for said standings. This dataset is unique from constructors_results as it also includes information about a constructor did for the entire season.
  1. constructors.csv : This dataset gives information on the different constructors within Formula One, their specific ID, and nationality. The following variables are relevant: constructorId, constructorRef, & nationality.
  1. driver_standings.csv : This dataset illustrates the outcomes of each race. It shows how each driver did at specific races and how many points were awarded. The following variables are pertinent to our analysis: raceId, driverId, points, position, & wins (number of wins so far in the season).

  2. drivers.csv : This dataset gives detailed information on all Formula One drivers for the past 70+ years. Within this dataset, there is information on nationality as well as age. The following variables were important in our analysis: driverId, forename, surname & nationality.

  3. lap_times.csv : This dataset details each driver and how they performed each lap of every race. Each row signifies how a specific driver performed for each lap. We are given position after completing the lap as well as the time of lap. The following variables were used throughout my analysis: raceId, driverId, lap, milliseconds (time of lap in milliseconds), & position (driver position after lap).

  4. pit_stops.csv : This dataset provides information about all pitstops within the races. The pitstop dataset provides detailed information on stop length, what lap the stop occurred, and which driver was stopping. The following variables were pertinent to our analysis: raceId, driverId, stop (for a specific driver at a specific race, which stop was this), lap number (lap when the stop occurred), & milliseconds (time of the pitstop in milliseconds).

  1. qualifying.csv : To determine grid placement on race day, drivers compete in qualifying rounds on Saturday. This dataset details how each driver did at qualifiers, the overall time for each lap, and the placement earned for Sunday's race. The following variables were pertinent and used in my analysis: raceId, driverId, constructorId, position (where the driver would start based on their qualifying times), q1, q2, q3. More information about how qualifiers work in F1 can be found later in the analysis.
  1. races.csv : This dataset gives specific IDs to each race within Formula One between 1950-2022. The dataset has information on the circuit, what date the race was on, and even what time the race began. The following variables are pertinent: raceId, year, circuitId, & name (of the circuit). There was also information about day and time of the race, qualifying rounds and sprint races, however, none of this data was used in my analysis.
  1. results.csv : This dataset summarizes pertinent information regarding the results of each Formula One race. Within the dataset, we can see where each driver started, where they finished and their fastest lap time. Some of the pertinent variables are: raceId, driverId, constructorId, grid (position when starting), position (position when finished), points, laps (the number of races in said race), time in milliseconds, fastestLap, fastestLapTime, fastestLapSpeed, & status of race.
  1. seasons.csv : This dataset provides a quick Wikipedia link for pertinent information on each Formula One Season.
  1. sprint_results.csv : Sprint races are new to Formula One. They are not offered at every race but typically only at 'big' races. Only 1/3 the time of a normal race, the drivers compete in a 'mock race' to earn more points for their team and learn the track. This dataset captures important timing and placing metrics for the recent sprint races.
  1. status.csv : This is a foreign key table that provides text explanations for the different race statuses.

Screen Shot 2022-12-13 at 3.48.16 PM.png

In [69]:
# reading each of the datasets into a dataframe 
circuits = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/circuits.csv")
constructor_results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructor_results.csv")
constructor_standings= pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructor_standings.csv")
constructors = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructors.csv")
driver_standings = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/driver_standings.csv")
drivers = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/drivers.csv")
lap_times = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/lap_times.csv")
pit_stops = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/pit_stops.csv")
qualifying = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/qualifying.csv")
races = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/races.csv")
results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/results.csv")
seasons = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/seasons.csv")
sprint_results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/sprint_results.csv")
status = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/status.csv")

Data Set Citations

  1. I found all of these datasets on Kaggle. The user who posted these datasets cleaned the data and compiled each table from http://ergast.com/mrd/

Screen Shot 2022-12-13 at 3.53.07 PM.png

Within Formula One's 72 year history, there have been 171 constructors. In order to accuratly make predictions based on the current Formula One Constructors, I scraped the wikipedia pages of the ten constructors who raced in 2022. I was able gather historical information on drivers for these 10 teams. Later in this project, I concactonated the data from these teams and I was able to see driver switches over time. The web scraping I did acted as a foreign key to all the Kaggle data frames, once it was in place, I could down select any of the Kaggle dataframes to only include information on current drivers.

In [70]:
## Red Bull Racing: Drivers by Year
url = 'https://en.wikipedia.org/wiki/Red_Bull_Racing'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

red_bull_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  red_bull_drivers.append(df_t[0])

red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['Car', 'Engine', 'Tyres', 'No.', 'Points', 'Position', 'Name'])
red_bull_drivers[3] = red_bull_drivers[3].join(red_bull_drivers[3]['Drivers'].str.split( expand =True).rename(
    columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}
)) 
red_bull_drivers[3]['Driver1'] = red_bull_drivers[3]['First1']  + ' ' + red_bull_drivers[3]['Last1'] 
red_bull_drivers[3]['Driver2'] = red_bull_drivers[3]['First2']  + ' ' + red_bull_drivers[3]['Last2'] 
red_bull_drivers[3]['Driver3'] = red_bull_drivers[3]['First3']  + ' ' + red_bull_drivers[3]['Last3'] 

red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['Drivers', 'First1', 'Last1', 'First2', 'Last2', 'First3', 'Last3'])
red_bull_drivers[3].drop([18], axis=0, inplace=True)
red_bull_drivers[3] = pd.melt(red_bull_drivers[3], id_vars=['Year'])
red_bull_drivers[3] = red_bull_drivers[3].dropna()
red_bull_drivers[3]['Constructor'] = 'Red Bull'
red_bull_drivers[3] = red_bull_drivers[3].rename(columns={"value": "Driver"})
red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['variable'])


red_bull_drivers[3]['constructorId'] = '9'


red_bull_drivers[3]
Out[70]:
Year Driver Constructor constructorId
0 2005 David Coulthard Red Bull 9
1 2006 David Coulthard Red Bull 9
2 2007 David Coulthard Red Bull 9
3 2008 David Coulthard Red Bull 9
4 2009 Mark Webber Red Bull 9
5 2010 Sebastian Vettel Red Bull 9
6 2011 Sebastian Vettel Red Bull 9
7 2012 Sebastian Vettel Red Bull 9
8 2013 Sebastian Vettel Red Bull 9
9 2014 Sebastian Vettel Red Bull 9
10 2015 Daniel Ricciardo Red Bull 9
11 2016 Daniel Ricciardo Red Bull 9
12 2017 Daniel Ricciardo Red Bull 9
13 2018 Daniel Ricciardo Red Bull 9
14 2019 Pierre Gasly Red Bull 9
15 2020 Alex Albon Red Bull 9
16 2021 Sergio Pérez Red Bull 9
17 2022 Max Verstappen Red Bull 9
18 2005 Christian Klien Red Bull 9
19 2006 Christian Klien Red Bull 9
20 2007 Mark Webber Red Bull 9
21 2008 Mark Webber Red Bull 9
22 2009 Sebastian Vettel Red Bull 9
23 2010 Mark Webber Red Bull 9
24 2011 Mark Webber Red Bull 9
25 2012 Mark Webber Red Bull 9
26 2013 Mark Webber Red Bull 9
27 2014 Daniel Ricciardo Red Bull 9
28 2015 Daniil Kvyat Red Bull 9
29 2016 Daniil Kvyat Red Bull 9
30 2017 Max Verstappen Red Bull 9
31 2018 Max Verstappen Red Bull 9
32 2019 Alex Albon Red Bull 9
33 2020 Max Verstappen Red Bull 9
34 2021 Max Verstappen Red Bull 9
35 2022 Sergio Pérez Red Bull 9
36 2005 Vitantonio Liuzzi Red Bull 9
37 2006 Robert Doornbos Red Bull 9
47 2016 Max Verstappen Red Bull 9
50 2019 Max Verstappen Red Bull 9
In [71]:
## Merecedes: Drivers by Year
url = 'https://en.wikipedia.org/wiki/Mercedes-Benz_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

mercedes_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  mercedes_drivers.append(df_t[0])

mercedes_drivers[5].drop([2], axis =0, inplace =True)
mercedes_drivers[5] = mercedes_drivers[5][['Year', 'Drivers']]

# Please note that Mercredes raced drivers in 1954 and 1955 before taking a hiatus uintil 2010. 
# For the purposes of my analysis on the current Mercedes team, I did not include 1954 - 1955 as the rules were different and there were many more drivers
mercedes_drivers[5].drop([0], axis =0, inplace =True)
mercedes_drivers[5].drop([1], axis =0, inplace =True)
mercedes_drivers[5].drop([16], axis =0, inplace =True)

mercedes_drivers[5] = mercedes_drivers[5].join(mercedes_drivers[5]['Drivers'].str.split( expand =True).rename(
    columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}))

mercedes_drivers[5]['Driver1'] = mercedes_drivers[5]['First1']  + ' ' + mercedes_drivers[5]['Last1'] 
mercedes_drivers[5]['Driver2'] = mercedes_drivers[5]['First2']  + ' ' + mercedes_drivers[5]['Last2'] 
mercedes_drivers[5]['Driver3'] = mercedes_drivers[5]['First3']  + ' ' + mercedes_drivers[5]['Last3'] 

mercedes_drivers[5] = mercedes_drivers[5][['Year', 'Driver1', 'Driver2', 'Driver3']]
mercedes_drivers[5] = pd.melt(mercedes_drivers[5], id_vars=['Year'])
mercedes_drivers[5] = mercedes_drivers[5].dropna()

mercedes_drivers[5]['Constructor'] = 'Mercedes'
mercedes_drivers[5] = mercedes_drivers[5].rename(columns={"value": "Driver"})
mercedes_drivers[5] = mercedes_drivers[5].drop(columns= ['variable'])
mercedes_drivers[5]['constructorId'] = '131'


mercedes_drivers[5]
Out[71]:
Year Driver Constructor constructorId
0 2010 Michael Schumacher Mercedes 131
1 2011 Michael Schumacher Mercedes 131
2 2012 Michael Schumacher Mercedes 131
3 2013 Nico Rosberg Mercedes 131
4 2014 Nico Rosberg Mercedes 131
5 2015 Nico Rosberg Mercedes 131
6 2016 Nico Rosberg Mercedes 131
7 2017 Lewis Hamilton Mercedes 131
8 2018 Lewis Hamilton Mercedes 131
9 2019 Lewis Hamilton Mercedes 131
10 2020 Lewis Hamilton Mercedes 131
11 2021 Lewis Hamilton Mercedes 131
12 2022 Lewis Hamilton Mercedes 131
13 2010 Nico Rosberg Mercedes 131
14 2011 Nico Rosberg Mercedes 131
15 2012 Nico Rosberg Mercedes 131
16 2013 Lewis Hamilton Mercedes 131
17 2014 Lewis Hamilton Mercedes 131
18 2015 Lewis Hamilton Mercedes 131
19 2016 Lewis Hamilton Mercedes 131
20 2017 Valtteri Bottas Mercedes 131
21 2018 Valtteri Bottas Mercedes 131
22 2019 Valtteri Bottas Mercedes 131
23 2020 George Russell Mercedes 131
24 2021 Valtteri Bottas Mercedes 131
25 2022 George Russell Mercedes 131
36 2020 Valtteri Bottas Mercedes 131
In [72]:
#Ferrari
url = 'https://en.wikipedia.org/wiki/Ferrari_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

ferrari_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  ferrari_drivers.append(df_t[0])

ferrari_drivers= pd.concat([ferrari_drivers[6], ferrari_drivers[7]], ignore_index=True)

ferrari_drivers = ferrari_drivers[['Year', 'Driver']]
ferrari_drivers = ferrari_drivers.dropna()
ferrari_drivers.drop([30], axis =0, inplace =True)
ferrari_drivers.drop([40], axis =0, inplace =True)


# Adding constructor name 
ferrari_drivers['Constructor'] = 'Ferrari'
ferrari_drivers['constructorId'] = '6'


ferrari_drivers
Out[72]:
Year Driver Constructor constructorId
1 2010 Fernando Alonso Ferrari 6
2 2010 Felipe Massa Ferrari 6
4 2011 Fernando Alonso Ferrari 6
5 2011 Felipe Massa Ferrari 6
7 2012 Fernando Alonso Ferrari 6
8 2012 Felipe Massa Ferrari 6
10 2013 Fernando Alonso Ferrari 6
11 2013 Felipe Massa Ferrari 6
13 2014 Fernando Alonso Ferrari 6
14 2014 Kimi Räikkönen Ferrari 6
16 2015 Kimi Räikkönen Ferrari 6
17 2015 Sebastian Vettel Ferrari 6
19 2016 Kimi Räikkönen Ferrari 6
20 2016 Sebastian Vettel Ferrari 6
22 2017 Kimi Räikkönen Ferrari 6
23 2017 Sebastian Vettel Ferrari 6
25 2018 Kimi Räikkönen Ferrari 6
26 2018 Sebastian Vettel Ferrari 6
28 2019 Charles Leclerc Ferrari 6
29 2019 Sebastian Vettel Ferrari 6
32 2020 Charles Leclerc Ferrari 6
33 2020 Sebastian Vettel Ferrari 6
35 2021 Charles Leclerc Ferrari 6
36 2021 Carlos Sainz Jr. Ferrari 6
38 2022 Charles Leclerc Ferrari 6
39 2022 Carlos Sainz Jr. Ferrari 6
In [73]:
# Mclaren
url = 'https://en.wikipedia.org/wiki/McLaren_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

mclaren_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  mclaren_drivers.append(df_t[0])

mclaren_drivers = pd.concat([mclaren_drivers[5], mclaren_drivers[6]], ignore_index=True)

mclaren_drivers = mclaren_drivers[['Year', 'Drivers']]
mclaren_drivers = mclaren_drivers.dropna()
mclaren_drivers.drop([33], axis =0, inplace =True)
mclaren_drivers.drop([43], axis =0, inplace =True)


# Adding constructor name 
mclaren_drivers['Constructor'] = 'McLaren'
mclaren_drivers = mclaren_drivers.rename(columns={"Drivers": "Driver"})
mclaren_drivers['constructorId'] = '1'


mclaren_drivers
Out[73]:
Year Driver Constructor constructorId
1 2010 Jenson Button McLaren 1
2 2010 Lewis Hamilton McLaren 1
4 2011 Jenson Button McLaren 1
5 2011 Lewis Hamilton McLaren 1
7 2012 Jenson Button McLaren 1
8 2012 Lewis Hamilton McLaren 1
10 2013 Jenson Button McLaren 1
11 2013 Sergio Pérez McLaren 1
13 2014 Jenson Button McLaren 1
14 2014 Kevin Magnussen McLaren 1
16 2015 Jenson Button McLaren 1
17 2015 Kevin Magnussen McLaren 1
18 2015 Fernando Alonso McLaren 1
20 2016 Fernando Alonso McLaren 1
21 2016 Stoffel Vandoorne McLaren 1
22 2016 Jenson Button McLaren 1
24 2017 Fernando Alonso McLaren 1
25 2017 Jenson Button McLaren 1
26 2017 Stoffel Vandoorne McLaren 1
28 2018 Fernando Alonso McLaren 1
29 2018 Stoffel Vandoorne McLaren 1
31 2019 Lando Norris McLaren 1
32 2019 Carlos Sainz Jr. McLaren 1
35 2020 Lando Norris McLaren 1
36 2020 Carlos Sainz Jr. McLaren 1
38 2021 Lando Norris McLaren 1
39 2021 Daniel Ricciardo McLaren 1
41 2022 Lando Norris McLaren 1
42 2022 Daniel Ricciardo McLaren 1
In [74]:
# BWT Alpine
url = 'https://en.wikipedia.org/wiki/Alpine_F1_Team'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

alpine_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  alpine_drivers.append(df_t[0])

alpine_drivers[1] = alpine_drivers[1][['Year', 'Drivers']]
alpine_drivers[1] = alpine_drivers[1].dropna()
alpine_drivers[1].drop([6], axis =0, inplace =True)

alpine_drivers[1]['Constructor'] = 'Alpine F1 Team'
alpine_drivers[1]['constructorId'] = '214'
alpine_drivers[1] = alpine_drivers[1].rename(columns={"Drivers": "Driver"})

alpine_drivers[1]
Out[74]:
Year Driver Constructor constructorId
1 2021 Fernando Alonso Alpine F1 Team 214
2 2021 Esteban Ocon Alpine F1 Team 214
4 2022 Fernando Alonso Alpine F1 Team 214
5 2022 Esteban Ocon Alpine F1 Team 214
In [75]:
# Alpha Tauri 
url = 'https://en.wikipedia.org/wiki/Scuderia_AlphaTauri'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

alpha_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  alpha_drivers.append(df_t[0])

alpha_drivers[1] = alpha_drivers[1][['Year', 'Drivers']]
alpha_drivers[1] = alpha_drivers[1].dropna()
alpha_drivers[1].drop([9], axis =0, inplace =True)

alpha_drivers[1]['Constructor'] = 'AlphaTauri'
alpha_drivers[1]['constructorId'] = '213'
alpha_drivers[1] = alpha_drivers[1].rename(columns={"Drivers": "Driver"})

alpha_drivers[1]
Out[75]:
Year Driver Constructor constructorId
1 2020 Pierre Gasly AlphaTauri 213
2 2020 Daniil Kvyat AlphaTauri 213
4 2021 Pierre Gasly AlphaTauri 213
5 2021 Yuki Tsunoda AlphaTauri 213
7 2022 Pierre Gasly AlphaTauri 213
8 2022 Yuki Tsunoda AlphaTauri 213
In [76]:
# Aston Martin 
# Please note that Aston Martin also raced in 1959 - 1960, but I will not be including them in this analysis 
url = 'https://en.wikipedia.org/wiki/Aston_Martin_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

aston_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  aston_drivers.append(df_t[0])

aston_drivers[3] = aston_drivers[3][['Year', 'Driver']]
aston_drivers[3] = aston_drivers[3].dropna()
aston_drivers[3].drop([7], axis =0, inplace =True)

aston_drivers[3]['Constructor'] = 'Aston Martin'
aston_drivers[3]['constructorId'] = '117'

aston_drivers[3]
Out[76]:
Year Driver Constructor constructorId
1 2021 Lance Stroll Aston Martin 117
2 2021 Sebastian Vettel Aston Martin 117
4 2022 Lance Stroll Aston Martin 117
5 2022 Sebastian Vettel Aston Martin 117
6 2022 Nico Hülkenberg Aston Martin 117
In [77]:
# Alfa Romeo 
# Please note Alfa Romeo has raced in 1950-1951, 1979-1985, and 2019-2022 but for this project I will only include the third set of dates
url = 'https://en.wikipedia.org/wiki/Alfa_Romeo_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

alfa_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  alfa_drivers.append(df_t[0])

alfa_drivers[2] = alfa_drivers[2][['Year', 'Drivers']]
alfa_drivers[2] = alfa_drivers[2].dropna()
alfa_drivers[2].drop([2], axis =0, inplace =True)
alfa_drivers[2].drop([10], axis =0, inplace =True)
alfa_drivers[2].drop([15], axis =0, inplace =True)
alfa_drivers[2].drop(alfa_drivers[2].loc[0:9].index, axis =0, inplace =True)

alfa_drivers[2] = alfa_drivers[2].join(alfa_drivers[2]['Drivers'].str.split( expand =True).rename(
    columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}))

alfa_drivers[2]['Driver1'] = alfa_drivers[2]['First1']  + ' ' + alfa_drivers[2]['Last1'] 
alfa_drivers[2]['Driver2'] = alfa_drivers[2]['First2']  + ' ' + alfa_drivers[2]['Last2'] 
alfa_drivers[2]['Driver3'] = alfa_drivers[2]['First3']  + ' ' + alfa_drivers[2]['Last3'] 

alfa_drivers[2] = alfa_drivers[2][['Year', 'Driver1', 'Driver2', 'Driver3']]

alfa_drivers[2] = pd.melt(alfa_drivers[2], id_vars=['Year'])
alfa_drivers[2] = alfa_drivers[2].dropna()

alfa_drivers[2]['Constructor'] = 'Alfa Romeo'
alfa_drivers[2]= alfa_drivers[2].drop(columns= ['variable'])
alfa_drivers[2] = alfa_drivers[2].rename(columns={"value": "Driver"})
alfa_drivers[2]['constructorId'] = '51'



alfa_drivers[2]
Out[77]:
Year Driver Constructor constructorId
0 2019 Kimi Räikkönen Alfa Romeo 51
1 2020 Kimi Räikkönen Alfa Romeo 51
2 2021 Kimi Räikkönen Alfa Romeo 51
3 2022 Zhou Guanyu Alfa Romeo 51
4 2019 Antonio Giovinazzi Alfa Romeo 51
5 2020 Antonio Giovinazzi Alfa Romeo 51
6 2021 Robert Kubica Alfa Romeo 51
7 2022 Valtteri Bottas Alfa Romeo 51
10 2021 Antonio Giovinazzi Alfa Romeo 51
In [78]:
# Haas
url = 'https://en.wikipedia.org/wiki/Haas_F1_Team'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

haas_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  haas_drivers.append(df_t[0])

haas_drivers[1] = haas_drivers[1][['Year', 'Drivers']]
haas_drivers[1] = haas_drivers[1].dropna()
haas_drivers[1].drop([22], axis =0, inplace =True)

haas_drivers[1]['Constructor'] = 'Haas F1 Team'
haas_drivers[1]['constructorId'] = '210'
haas_drivers[1] = haas_drivers[1].rename(columns={"Drivers": "Driver"})


haas_drivers[1]
Out[78]:
Year Driver Constructor constructorId
1 2016 Romain Grosjean Haas F1 Team 210
2 2016 Esteban Gutiérrez Haas F1 Team 210
4 2017 Romain Grosjean Haas F1 Team 210
5 2017 Kevin Magnussen Haas F1 Team 210
7 2018 Romain Grosjean Haas F1 Team 210
8 2018 Kevin Magnussen Haas F1 Team 210
10 2019 Romain Grosjean Haas F1 Team 210
11 2019 Kevin Magnussen Haas F1 Team 210
13 2020 Romain Grosjean Haas F1 Team 210
14 2020 Pietro Fittipaldi Haas F1 Team 210
15 2020 Kevin Magnussen Haas F1 Team 210
17 2021 Nikita Mazepin[a] Haas F1 Team 210
18 2021 Mick Schumacher Haas F1 Team 210
20 2022 Kevin Magnussen Haas F1 Team 210
21 2022 Mick Schumacher Haas F1 Team 210
In [79]:
# Williams Racing 

url = 'https://en.wikipedia.org/wiki/Williams_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

williams_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  williams_drivers.append(df_t[0])

williams_drivers = pd.concat([williams_drivers[4], williams_drivers[5]], ignore_index=True)

williams_drivers = williams_drivers[['Year', 'Drivers']]
williams_drivers = williams_drivers.dropna()
williams_drivers.drop([31], axis =0, inplace =True)
williams_drivers.drop([43], axis =0, inplace =True)


# Adding constructor name 
williams_drivers['Constructor'] = 'Williams'
williams_drivers = williams_drivers.rename(columns={"Drivers": "Driver"})
williams_drivers['constructorId'] = '3'

williams_drivers
Out[79]:
Year Driver Constructor constructorId
1 2010 Rubens Barrichello Williams 3
2 2010 Nico Hülkenberg Williams 3
4 2011 Rubens Barrichello Williams 3
5 2011 Pastor Maldonado Williams 3
7 2012 Pastor Maldonado Williams 3
8 2012 Bruno Senna Williams 3
10 2013 Pastor Maldonado Williams 3
11 2013 Valtteri Bottas Williams 3
13 2014 Felipe Massa Williams 3
14 2014 Valtteri Bottas Williams 3
16 2015 Felipe Massa Williams 3
17 2015 Valtteri Bottas Williams 3
19 2016 Felipe Massa Williams 3
20 2016 Valtteri Bottas Williams 3
22 2017 Lance Stroll Williams 3
23 2017 Felipe Massa Williams 3
24 2017 Paul di Resta Williams 3
26 2018 Lance Stroll Williams 3
27 2018 Sergey Sirotkin Williams 3
29 2019 George Russell Williams 3
30 2019 Robert Kubica Williams 3
33 2020 Nicholas Latifi Williams 3
34 2020 George Russell Williams 3
35 2020 Jack Aitken Williams 3
37 2021 Nicholas Latifi Williams 3
38 2021 George Russell Williams 3
40 2022 Alexander Albon Williams 3
41 2022 Nicholas Latifi Williams 3
42 2022 Nyck de Vries Williams 3
In [80]:
# COMBINING EACH OF THE DATAFRAMES FROM THE WEB SCRAPING INTO A LARGE DATAFRAME
current_drivers = pd.concat([haas_drivers[1], alfa_drivers[2],aston_drivers[3], alpha_drivers[1], alpine_drivers[1]
, mercedes_drivers[5], red_bull_drivers[3], mclaren_drivers, williams_drivers, ferrari_drivers], ignore_index=True)
current_drivers
# THIS DATAFRAME SHOWS THAT THE DRIVERS, THEIR TEAMS, AND THE YEAR FOR THE TEN CURRENT F1 TEAMS
Out[80]:
Year Driver Constructor constructorId
0 2016 Romain Grosjean Haas F1 Team 210
1 2016 Esteban Gutiérrez Haas F1 Team 210
2 2017 Romain Grosjean Haas F1 Team 210
3 2017 Kevin Magnussen Haas F1 Team 210
4 2018 Romain Grosjean Haas F1 Team 210
... ... ... ... ...
185 2020 Sebastian Vettel Ferrari 6
186 2021 Charles Leclerc Ferrari 6
187 2021 Carlos Sainz Jr. Ferrari 6
188 2022 Charles Leclerc Ferrari 6
189 2022 Carlos Sainz Jr. Ferrari 6

190 rows × 4 columns

EXTRACTION, TRANSFORMATION & LOADING

Screen Shot 2022-12-13 at 4.27.36 PM.png

Cleaning the lap_times dataframe & displaying dtypes for incorporation into the regression model later in this project

In [81]:
# RENAMING COLUMNS TO BE UNDERSTANDABLE
lap_times = lap_times.rename(columns={"time": "lap_time", "milliseconds": "lap_in_milli", "position" : "position_after_lap"})
# WE WILL ONLY BE USING MILLISECONDS, DROPPING LAP TIME IN MINUTES AND SECONDS
lap_times = lap_times.drop(columns=['lap_time'])
lap_times
Out[81]:
raceId driverId lap position_after_lap lap_in_milli
0 841 20 1 1 98109
1 841 20 2 1 93006
2 841 20 3 1 92713
3 841 20 4 1 92803
4 841 20 5 1 92342
... ... ... ... ... ...
538116 1096 822 53 16 92998
538117 1096 822 54 16 92995
538118 1096 822 55 16 91236
538119 1096 822 56 15 90566
538120 1096 822 57 15 90743

538121 rows × 5 columns

In [82]:
# ALL OF THE VARIABLES IN LAP_TIMES SHOULD BE QUANTITATIVE. IDS, POSITIONS, AND TIMES SHOULD BE INTEGERS THAT WE CAN DO CALCULATIONS WITH
lap_times.dtypes
Out[82]:
raceId                int64
driverId              int64
lap                   int64
position_after_lap    int64
lap_in_milli          int64
dtype: object

Screen Shot 2022-12-13 at 4.31.41 PM.png

In [83]:
# RENAMING COLUMNS TO BE UNDERSTANDABLE
pit_stops = pit_stops.rename(columns={"time" : "time_when_stopped", "milliseconds": "stop_in_milli", "stop" : "stop_#"})
# WE WILL ONLY BE USING MILLISECONDS, DROPPING LAP TIME IN MINUTES AND SECONDS
pit_stops = pit_stops.drop(columns=['duration'])
pit_stops
Out[83]:
raceId driverId stop_# lap time_when_stopped stop_in_milli
0 841 153 1 1 17:05:23 26898
1 841 30 1 1 17:05:52 25021
2 841 17 1 11 17:20:48 23426
3 841 4 1 12 17:22:34 23251
4 841 13 1 13 17:24:10 23842
... ... ... ... ... ... ...
9629 1096 849 2 38 18:02:50 25174
9630 1096 840 2 40 18:04:44 21802
9631 1096 839 2 41 18:06:09 21734
9632 1096 846 2 42 18:07:36 21559
9633 1096 855 2 44 18:11:10 23159

9634 rows × 6 columns

In [84]:
# ALL OF THE VARIABLES IN PIT_STOPS SHOULD BE QUANTITATIVE. IDS, STOP NUMBERS, AND TIMES SHOULD BE INTEGERS THAT WE CAN DO CALCULATIONS WITH
pit_stops.dtypes
Out[84]:
raceId                int64
driverId              int64
stop_#                int64
lap                   int64
time_when_stopped    object
stop_in_milli         int64
dtype: object

Screen Shot 2022-12-13 at 9.53.47 PM.png

How do qualifying rounds work in Formula One?

There are three qualifying rounds for each race weekend in Formula One. These rounds take place typically over two days prior to the race on Sunday. In the first round, all twenty Formula One drivers participate. The driver's race one lap individually and attempt to recieve the fastest time. Those in the top fifteen will move on to the second round of qualifiers. Those with the bottom five times, will start race day in the last five positions on the grid. In the second round of qualifiers, drivers drive one lap, once again aiming for the faastest time. Those with the fastest ten times will move on to the third round of qualifiers. Those who had the bottom five race times in Q2 will be placed in spots 10-15 on the grid for Sunday's race. The final qualifying round determines the order of the drivers who will be in positions 1-10 on race day.

How should this information shape our data?

Since not all drivers will participate in all three rounds of qualifying, I am going to create dummy variables that will indicate (based on their success in qualifiers) whether a driver will race in the top ten, middle five, or bottom five.

In [85]:
# CLEANING THE QUALIFYING DATASET, DROPPING UNNEEDED VARIABLES
qualifying = qualifying.drop(columns = ["number"])
# REPLACING NAN WITH 0, THIS MEANS THAT THEY DID NOT RACE IN THAT ROUND, NOTE EVERYONE SHOULD HAVE A TIME FOR Q1 BUT NOT EVERYONE WILL MAKE IT TO Q3
qualifying = qualifying.replace('\\N','0')
# DUMMY TO INDICATE THAT THEY WERE PLACED IN TOP TEN STARTING POSITIONS (1-10)
qualifying['top_ten'] = np.where(qualifying['q3'] != '0' , 1, 0)
# DUMMY TO INDICATE THAT THEY WERE IN THE MIDDLE FIVE STARTING POSITIONS (11-15)
qualifying['middle'] = np.where((qualifying['q3'] == '0') & (qualifying['q2'] >'0'), 1, 0)
# DUMMY TO INDICARE THAT THEY WERE IN THE BOTTOM FIVE STARTING POSITIONS (16-20)
qualifying['bottom_five'] = np.where((qualifying['q3']=='0') & (qualifying['q2']=='0'), 1, 0)
# RENAMING THE COLUMN SO WE KNOW WHERE THEY STARTED
qualifying = qualifying.rename(columns= {"position" : "starting_position"})
# MERGING IN INFOMATION ABOUT WHAT RACE
qualifying = qualifying.merge(races, how='left', on="raceId")
qualifying = qualifying[['qualifyId', 'raceId', 'driverId','constructorId', 'starting_position', 'top_ten', 'middle', 'bottom_five', 'circuitId', 'year', 'name']]
qualifying
Out[85]:
qualifyId raceId driverId constructorId starting_position top_ten middle bottom_five circuitId year name
0 1 18 1 1 1 1 0 0 1 2008 Australian Grand Prix
1 2 18 9 2 2 1 0 0 1 2008 Australian Grand Prix
2 3 18 5 1 3 1 0 0 1 2008 Australian Grand Prix
3 4 18 13 6 4 1 0 0 1 2008 Australian Grand Prix
4 5 18 2 2 5 1 0 0 1 2008 Australian Grand Prix
... ... ... ... ... ... ... ... ... ... ... ...
9570 9628 1096 825 210 16 0 0 1 24 2022 Abu Dhabi Grand Prix
9571 9629 1096 842 213 17 0 0 1 24 2022 Abu Dhabi Grand Prix
9572 9630 1096 822 51 18 0 0 1 24 2022 Abu Dhabi Grand Prix
9573 9631 1096 848 3 19 0 0 1 24 2022 Abu Dhabi Grand Prix
9574 9632 1096 849 3 20 0 0 1 24 2022 Abu Dhabi Grand Prix

9575 rows × 11 columns

In [86]:
qualifying.dtypes
# ALL OF THE VARIABLES IN THE QUALIFYING DATAFRAME SHOULD BE INTEGERS AS THEY EITHER REPRESENT IDS, POSITIONS, OR DUMMY VARIABLES. 
# THE ONLY EXCEPTION TO THIS RULE IS THE NAME OF THE CIRCUIT
Out[86]:
qualifyId             int64
raceId                int64
driverId              int64
constructorId         int64
starting_position     int64
top_ten               int64
middle                int64
bottom_five           int64
circuitId             int64
year                  int64
name                 object
dtype: object

Screen Shot 2022-12-13 at 10.23.51 PM.png

In [87]:
# DROPING COLUMNS I DON'T NEED FROM THE ORIGINAL RESULTS COLUMN
results2 = results.drop(columns=['resultId', 'number', 'positionText','positionOrder', 'time'])
# RENAMING THE COLUMNS FOR UNDERSTANDABILITY
results2 = results2.rename(columns= {"grid" : "starting_position", "position":"finishing_position", "rank" : "overall_standing", "milliseconds":"time_milliseconds"})
# FILLING IN LAP TIMES THAT HAD NO DATA (INDICATING THAT THE DRIVER DID NOT RACE OR DID NOT FINISH THE RACE, WITH ZEROS)
results2 = results2.replace('\\N','0')
# CHANGING THE DTYPES TO BE QUANTITATIVE IF A NUMBER AND A INTERGER IF IT REPRESENTS A CATEGORY 
results2['time_milliseconds'] = results2['time_milliseconds'].astype(int)
results2['finishing_position'] = results2['finishing_position'].astype(int)
results2['overall_standing'] = results2['overall_standing'].astype(int)
results2['statusId'] = results2['statusId'].astype(str)
results2['fastestLap'] = results2['fastestLap'].astype(int)
results2['fastestLapSpeed'] = results2['fastestLapSpeed'].astype(float)
# ADDING A DUMMY TO INDICATE THAT THE DRIVER FINISHED THE RACE
results2['finished'] = np.where(results2['statusId']=='1', 1,0)
# ADDING DUMMIES THAT INDICATE WHERE THE DRIVER PLACED
results2['podium'] = np.where(results2['finishing_position']<=3, 1,0)
results2['top_ten'] = np.where((results2['finishing_position']>3) & (results2['finishing_position']<=10), 1,0)
results2['bottom_ten'] = np.where(results2['finishing_position']>10, 1,0)
results2 = results2.drop(columns=['fastestLapTime'])
results2
Out[87]:
raceId driverId constructorId starting_position finishing_position points laps time_milliseconds fastestLap overall_standing fastestLapSpeed statusId finished podium top_ten bottom_ten
0 18 1 1 1 1 10.0 58 5690616 39 2 218.300 1 1 1 0 0
1 18 2 2 5 2 8.0 58 5696094 41 3 217.586 1 1 1 0 0
2 18 3 3 7 3 6.0 58 5698779 41 5 216.719 1 1 1 0 0
3 18 4 4 11 4 5.0 58 5707797 58 7 215.464 1 1 0 1 0
4 18 5 1 3 5 4.0 58 5708630 43 1 218.385 1 1 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25835 1096 854 210 12 16 0.0 57 0 39 12 211.632 11 0 0 0 1
25836 1096 825 210 16 17 0.0 57 0 40 20 208.556 11 0 0 0 1
25837 1096 1 131 5 18 0.0 55 0 42 11 211.738 6 0 0 0 1
25838 1096 849 3 20 19 0.0 55 0 45 14 210.517 130 0 0 0 1
25839 1096 4 214 10 0 0.0 27 0 24 17 209.889 47 0 1 0 0

25840 rows × 16 columns

In [88]:
results2.dtypes
# ALL OF THE VARIABLES IN THE REFINED RESULTS DATAFRAME SHOULD BE NUMERICAL AS THEY REPRESENT TIMES, IDS, AND DUMMY VARIABLES 
# WE ARE KEEPING STATUSID AS AN OBJECT BECUASE THE NUMBERS REFER TO SPECIFIC CATEGORIES REGARDING HOW A DRIVER FINISHED OR WHAT MAY HAVE CAUSED THEM TO NOT COMPLETE A RACE
Out[88]:
raceId                  int64
driverId                int64
constructorId           int64
starting_position       int64
finishing_position      int64
points                float64
laps                    int64
time_milliseconds       int64
fastestLap              int64
overall_standing        int64
fastestLapSpeed       float64
statusId               object
finished                int64
podium                  int64
top_ten                 int64
bottom_ten              int64
dtype: object

Screen Shot 2022-12-13 at 10.30.06 PM.png

  1. Concat web scrapes & limit drivers: The first step towards my regression model is limiting the data to only include information on drivers who have raced for the 10 current Formula one teams. As mentioned in the data section, I scraped the Wikipedia pages of each team for this information. Now, I am going to create a list of unique drivers that will act as a foreign key to any dataframe I use going for
In [89]:
# LET'S CREATE A LIST OF THE DRIVERS WHO HAVE DRIVEN FOR THE 10 CURRENT TEAMS
names_of_drivers = current_drivers['Driver'].unique()
names_of_drivers = pd.DataFrame(data = names_of_drivers, columns = ['Driver'])
# THERE WERE A HANDFUL OF SPELLING MISTAKES IN DRIVER NAME BETWEEN THE SCRAPES AND KAGGLE
# I HAVE MANUALLY ADJUSTED THESE SO THAT MERGING WOULD BE DONE EASILY
names_of_drivers = names_of_drivers.replace("Nikita Mazepin[c]", "Nikita Mazepin")
names_of_drivers = names_of_drivers.replace("Carlos Sainz Jr.", "Carlos Sainz")
names_of_drivers = names_of_drivers.replace("Zhou Guanyu", "Guanyu Zhou")
names_of_drivers = names_of_drivers.replace("Alex Albon", "Alexander Albon")
# NOW LETS MERGE IN THE DRIVER IDS 
# FIRST I HAVE TO CREATE A DATAFRAME THE CONTAINS THE NAME OF THE DRIVERS AND THEIR FULL NAMES USING THE KAGGLE DATASETS
# THEN I WILL USE THIS DATAFRAME AND MERGE IT WITH OUR UNIQUE NAMES TO CREATE A FOREIGN KEY 
drivers_names  = pd.DataFrame(data=drivers[['driverId', 'forename', 'surname']])
drivers_names['Driver'] = drivers_names['forename'] + ' ' + drivers_names['surname']
drivers_names = drivers_names[['driverId', 'Driver']]

names_of_drivers = names_of_drivers.merge(drivers_names, how='left')
names_of_drivers =names_of_drivers.dropna()
names_of_drivers.head()
# THERE ARE 46 UNIQUE DRIVERS WHO HAVE DRIVEN OR ARE CURRENTLY DRIVING FOR THE 10 CURRENT F1 TEAMS
Out[89]:
Driver driverId
0 Romain Grosjean 154.0
1 Esteban Gutiérrez 821.0
2 Kevin Magnussen 825.0
3 Pietro Fittipaldi 850.0
5 Mick Schumacher 854.0

2. Ranking Circuits Now that we have created a way to limit our data only to the current drivers. Lets building a ranking system for the circuits. I will build a ranking system using latitude, longitude, altitude, and average lap time. The following assumptions were made when I build my ranking model:

  • The further a circuit's latitude was from the average across all circuits, the circuit should be ranked higher in difficulty
  • The further a circuit's longitude was from the average across all circuits, the circuit should be ranked higher in difficulty.
  • The further a circuit's altitude was from the median across all circuits, the circuit should be ranked higher in difficulty.
    • I used median as opposed to average when dealing with altitude because the range was not limited like when we dealt with longitude and latitude and thus outliers skewed the average.
  • The higher a circuits average lap time was across all drivers within the given years, the circuit should be ranked higher in difficulty.
  • A higher difficulty will have a lower rank multiplier because I want to reward faster times on harder tracks
In [90]:
# LETS CREATE A RANKING SYSTEM FOR EACH OF THE CIRCUITS USED SINCE 2010
#LIMITING THE RACES FROM 2010 ONWARD AS THIS IS WHAT OUR DRIVERS WILL BE LIMITED TO
races_lim = races.loc[races['year'] >=2010]
# LIMITING THE CIRCUITS DATAFRAME BASED ON MATCHES IN THIS LIMITED SET 
circuits_lim = circuits.merge(races_lim, how='inner', on = ['circuitId'])
# LIMITING THE CIRCUITS DATA FRAME TO ONLY INCLUDE THE INFORMATION WE NEED
circuits_lim= circuits_lim[['circuitId', 'circuitRef', 'lat', 'lng', 'alt']]
#TAKING THE ABS VALUE OF THE LAT AND LONG SO I CAN PULL IT INTO THE RANKING SYSTEM
# ABS VALUES ALSO ENSURES THAT BEING ABOVE OR BELOW THE EQUATOR / TO THE LEFT OR RIGHT OF THE PRIME MERIDIAN DOES NOT CANCEL OUT ANY AVERAGES
circuits_lim['lat'] = circuits_lim['lat'].abs()
circuits_lim['lng'] = circuits_lim['lng'].abs()
# DELETING DUPLICATES 
circuits_lim = circuits_lim.drop_duplicates(subset=['circuitId'])
# FILLING THE THREE /N VALUES WITH THE AVERAGE ALT FOR THAT CITY 
circuits_lim = circuits_lim.replace('\\N','0')
circuits_lim['alt'] = circuits_lim['alt'].astype(int)
# CREATING VARIABLES THAT WILL SHOW THE DISTANCE FROM THE AVERAGE FOR EACH GEOGRAPHICAL METRIC
circuits_lim['long_diff'] = circuits_lim['lng'].apply(lambda lng: lng - circuits_lim['lng'].mean()).abs()
circuits_lim['lat_diff'] = circuits_lim['lat'].apply(lambda lat: lat - circuits_lim['lat'].mean()).abs()
circuits_lim['alt_diff'] = circuits_lim['alt'].apply(lambda alt: alt - circuits_lim['alt'].median()).abs()

circuits_lim
Out[90]:
circuitId circuitRef lat lng alt long_diff lat_diff alt_diff
0 1 albert_park 37.84970 144.968000 10 96.667412 1.767908 31.0
11 2 sepang 2.76083 101.738000 18 53.437412 33.320962 23.0
19 3 bahrain 26.03250 50.510600 7 2.210012 10.049292 34.0
32 4 catalunya 41.57000 2.261110 109 46.039478 5.488208 68.0
45 5 istanbul 40.95170 29.405000 130 18.895588 4.869908 89.0
49 6 monaco 43.73470 7.420560 7 40.880028 7.652908 34.0
61 7 villeneuve 45.50000 73.522800 13 25.222212 9.418208 28.0
72 9 silverstone 52.07860 1.016940 153 47.283648 15.996808 112.0
86 10 hockenheimring 49.32780 8.565830 103 39.734758 13.246008 62.0
92 11 hungaroring 47.57890 19.248600 264 29.051988 11.497108 223.0
105 12 valencia 39.45890 0.331667 4 47.968921 3.377108 37.0
108 13 spa 50.43720 5.971390 401 42.329198 14.355408 360.0
121 14 monza 45.61560 9.281110 162 39.019478 9.533808 121.0
134 15 marina_bay 1.29140 103.864000 18 55.563412 34.790392 23.0
145 17 shanghai 31.33890 121.220000 5 72.919412 4.742892 36.0
155 18 interlagos 23.70360 46.699700 785 1.600888 12.378192 744.0
167 20 nurburgring 50.33560 6.947500 578 41.353088 14.253808 537.0
170 21 imola 44.34390 11.716700 37 36.583888 8.262108 4.0
173 22 suzuka 34.84310 136.541000 45 88.240412 1.238692 4.0
184 24 yas_marina 24.46720 54.603100 3 6.302512 11.614592 38.0
197 32 rodriguez 19.40420 99.090700 2227 50.790112 16.677592 2186.0
204 34 ricard 43.25060 5.791670 432 42.508918 7.168808 391.0
208 35 yeongam 34.73330 126.417000 0 78.116412 1.348492 41.0
212 39 zandvoort 52.38880 4.540920 6 43.759668 16.307008 35.0
214 68 buddh 28.34870 77.533100 194 29.232512 7.733092 153.0
217 69 americas 30.13280 97.641100 161 49.340512 5.948992 120.0
227 70 red_bull_ring 47.21970 14.764700 678 33.535888 11.137908 637.0
238 71 sochi 43.40570 39.957800 2 8.342788 7.323908 39.0
246 73 baku 40.37250 49.853300 -7 1.552712 4.290708 48.0
252 75 portimao 37.22700 8.626700 108 39.673888 1.145208 67.0
254 76 mugello 43.99750 11.371900 255 36.928688 7.915708 214.0
255 77 jeddah 21.63190 39.104400 15 9.196188 14.449892 26.0
257 78 losail 25.49000 51.454200 0 3.153612 10.591792 41.0
258 79 miami 25.95810 80.238900 0 31.938312 10.123692 41.0
In [91]:
# NOW LETS LIMIT THE RESULTS DATAFRAME TO ONLY INCLUDE INFORMATION ON THE CURRENT DRIVERS AND CIRCUITS FROM 2010 ONWARD 
# ONCE WE HAVE THIS INFORMATION, WE CAN THEN CREATE A VARIABLE THAT AVERAGES THE LAP TIME ACROSS ALL DRIVERS AND WE CAN THEN ADD IT TO OUR RANKING MODEL

# Downselecting the results dataframe for only relevant variables
results_recent = results2[['driverId', 'raceId', 'starting_position', 'constructorId', 'finishing_position','fastestLapSpeed']]
# merge in circuit name based on raceID
results_recent = results_recent.merge(races, how='right')
results_recent = results_recent[['driverId', 'raceId', 'circuitId', 'name', 'year', 'constructorId','starting_position', 'finishing_position','fastestLapSpeed' ]]
results_recent = results_recent.dropna()
# I only want information for the list of 41 recent drivers 
results_recent = results_recent.merge(names_of_drivers, how='right')
# I now want information on their lap times in milliseconds
results_recent = results_recent.merge(lap_times, how='left')
# Adding constructor Name 
results_recent= results_recent.merge(constructors[['constructorId', 'constructorRef']], how = 'left', on = 'constructorId')
# limiting to 2010 onward
results_recent = results_recent.loc[results_recent['year'] >=2010]
# ADDING PITSTOP INFORMATION 
results_recent = results_recent.merge(pit_stops, how='left', on = ['driverId', 'lap', 'raceId'])
results_recent = results_recent.drop(columns=['position_after_lap', 'stop_#', 'time_when_stopped'])

# TO BUILD OUR RANKING MODEL, LETS ALSO FIND THE AVERAGE LAP TIME BY CIRCUIT FOR ALL CIRCUITS USED SINCE 2010
avg_lap_overall = results_recent.groupby(['circuitId'], as_index=False)[['lap_in_milli']].mean()
avg_lap_overall = avg_lap_overall.rename(columns= {"lap_in_milli" : "avg_lap_circuit"})
results_recent= results_recent.merge(avg_lap_overall[[ 'circuitId', 'avg_lap_circuit']], how = 'left', on = ['circuitId'] )
results_recent
Out[91]:
driverId raceId circuitId name year constructorId starting_position finishing_position fastestLapSpeed Driver lap lap_in_milli constructorRef stop_in_milli avg_lap_circuit
0 154 860 1 Australian Grand Prix 2012 208 3 0 0.000 Romain Grosjean 1.0 103730.0 lotus_f1 NaN 96150.180500
1 154 861 2 Malaysian Grand Prix 2012 208 6 0 155.333 Romain Grosjean 1.0 139773.0 lotus_f1 NaN 115520.417074
2 154 861 2 Malaysian Grand Prix 2012 208 6 0 155.333 Romain Grosjean 2.0 128464.0 lotus_f1 NaN 115520.417074
3 154 861 2 Malaysian Grand Prix 2012 208 6 0 155.333 Romain Grosjean 3.0 132909.0 lotus_f1 NaN 115520.417074
4 154 862 17 Chinese Grand Prix 2012 208 10 6 194.062 Romain Grosjean 1.0 112513.0 lotus_f1 NaN 105673.398333
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
241997 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 54.0 90095.0 ferrari NaN 105560.104535
241998 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 55.0 90151.0 ferrari NaN 105560.104535
241999 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 56.0 90033.0 ferrari NaN 105560.104535
242000 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 57.0 89829.0 ferrari NaN 105560.104535
242001 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 58.0 90500.0 ferrari NaN 105560.104535

242002 rows × 15 columns

In [92]:
# LETS MERGE THE AVERAGE LAP SPEED BACK INTO THE CIRCUITS LIMITED DATAFRAME THAT WE CREATED
circuits_lim = circuits_lim.merge(results_recent[['circuitId', 'avg_lap_circuit']], how = 'inner', on = ['circuitId'] )
circuits_lim = circuits_lim.drop_duplicates(subset=['circuitId'])
# NOW THAT WE HAVE ALL THE PERTINENT INFORMATION, LETS BUILD A RANKING SYSTEM 
# BECAUSE WE HAVE DATA WITH DIFFERENT RANGES, I DECIDED TO RANK EACH COLUMN INDIVUALLY AND THEN RANK THE SUM
circuits_lim["lat_rank"] = circuits_lim[["lat_diff"]].rank(method='average',ascending=True).astype(int)
circuits_lim["lng_rank"] = circuits_lim[["long_diff"]].rank(method='average',ascending=True).astype(int)
circuits_lim["alt_rank"] = circuits_lim[["alt_diff"]].rank(method='average',ascending=True).astype(int)
circuits_lim["lap_rank"] = circuits_lim[["avg_lap_circuit"]].rank(method='average',ascending=True).astype(int)
column_names = ['lat_rank', 'lng_rank', 'alt_rank', 'lap_rank']
circuits_lim['rank_sums']= circuits_lim[column_names].sum(axis=1)
# CREATING THE OVERALL RANK
circuits_lim["rank"] = circuits_lim['rank_sums'].rank(ascending = False)
# NOW I WILL CREATE A RANK FROM 1.001 - 1.034
# THIS WAY I CAN MULTIPLY DRIVER STATS GOING FORWARD IN ACCORDANCE TO HOW HARD EACH LAP WAS WITHOUT DRASTICALLY AFFECTING THE STATS
circuits_lim['rank_mult'] = circuits_lim['rank'].apply(lambda rank: (rank/1000)+1)
circuit_rank = circuits_lim[['circuitId', 'circuitRef', 'rank', 'rank_mult']]
circuit_rank
Out[92]:
circuitId circuitRef rank rank_mult
0 1 albert_park 25.0 1.0250
8770 2 sepang 6.0 1.0060
14911 3 bahrain 30.0 1.0300
27399 4 catalunya 21.0 1.0210
40800 5 istanbul 28.0 1.0280
44477 6 monaco 31.0 1.0310
58126 7 villeneuve 34.0 1.0340
69817 9 silverstone 2.0 1.0020
81123 10 hockenheimring 16.5 1.0165
87311 11 hungaroring 14.0 1.0140
101636 12 valencia 15.0 1.0150
103919 13 spa 1.0 1.0010
111773 14 monza 12.0 1.0120
122590 15 marina_bay 4.0 1.0040
132462 17 shanghai 13.0 1.0130
140982 18 interlagos 19.5 1.0195
154276 20 nurburgring 5.0 1.0050
156958 21 imola 32.0 1.0320
160407 22 suzuka 24.0 1.0240
168810 24 yas_marina 22.5 1.0225
180305 32 rodriguez 3.0 1.0030
188826 34 ricard 9.5 1.0095
192732 35 yeongam 9.5 1.0095
195714 39 zandvoort 19.5 1.0195
198497 68 buddh 22.5 1.0225
201097 69 americas 8.0 1.0080
209904 70 red_bull_ring 16.5 1.0165
222842 71 sochi 27.0 1.0270
229697 73 baku 26.0 1.0260
234769 75 portimao 33.0 1.0330
237302 76 mugello 7.0 1.0070
238141 77 jeddah 11.0 1.0110
239833 78 losail 29.0 1.0290
240890 79 miami 18.0 1.0180

3. Yearly Averages for Drivers I will now modify the results dataframe heavily to only contain year to year averages. This was done by merging multiple different kaggle datasets, performing aggregations on specific columns, and deleting unneccesary variables. After duplicates were removed, we were left with year-to-year averages for each of our recents drivers since 2010

In [93]:
results_recent= results_recent[['Driver', 'driverId', 'year', 'constructorRef', 'constructorId','raceId', 'circuitId', 'name', 'lap', 'lap_in_milli', 'stop_in_milli', 'fastestLapSpeed', 'starting_position', 'finishing_position']]
# MERGING IN THE RANK MULTIPLIER FOR EACH CIRCUIT
results_recent = results_recent.merge(circuit_rank[['rank_mult', 'circuitId']], how = "inner", on = ['circuitId'])
# REPLACING THE LAP IN MILLI, STOP IN MILLI AND FASTEST LAP SPEED WITH AN INTERACTION BETWEEN THE ORIGINAL VARIABLE AND THE CIRCUIT MULTIPLIER
results_recent['lap_in_milli'] = results_recent['lap_in_milli'] * results_recent['rank_mult']
results_recent['stop_in_milli'] = results_recent['stop_in_milli'] * results_recent['rank_mult']
results_recent['fastestLapSpeed'] = results_recent['fastestLapSpeed'] * results_recent['rank_mult']
# DOING THE SAME THING TO THE STARTING AND FINISHING POSITION
results_recent['starting_position'] = results_recent['starting_position'] * results_recent['rank_mult']
results_recent['finishing_position'] = results_recent['finishing_position'] * results_recent['rank_mult']
results_recent
Out[93]:
Driver driverId year constructorRef constructorId raceId circuitId name lap lap_in_milli stop_in_milli fastestLapSpeed starting_position finishing_position rank_mult
0 Romain Grosjean 154 2012 lotus_f1 208 860 1 Australian Grand Prix 1.0 106323.250 NaN 0.000000 3.075 0.000 1.025
1 Romain Grosjean 154 2013 lotus_f1 208 880 1 Australian Grand Prix 1.0 110063.475 NaN 216.472825 8.200 10.250 1.025
2 Romain Grosjean 154 2013 lotus_f1 208 880 1 Australian Grand Prix 2.0 98367.200 NaN 216.472825 8.200 10.250 1.025
3 Romain Grosjean 154 2013 lotus_f1 208 880 1 Australian Grand Prix 3.0 97798.325 NaN 216.472825 8.200 10.250 1.025
4 Romain Grosjean 154 2013 lotus_f1 208 880 1 Australian Grand Prix 4.0 98516.850 NaN 216.472825 8.200 10.250 1.025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
241997 Charles Leclerc 844 2021 ferrari 6 1051 78 Qatar Grand Prix 53.0 90218.604 NaN 230.254185 13.377 8.232 1.029
241998 Charles Leclerc 844 2021 ferrari 6 1051 78 Qatar Grand Prix 54.0 102602.619 NaN 230.254185 13.377 8.232 1.029
241999 Charles Leclerc 844 2021 ferrari 6 1051 78 Qatar Grand Prix 55.0 111782.328 NaN 230.254185 13.377 8.232 1.029
242000 Charles Leclerc 844 2021 ferrari 6 1051 78 Qatar Grand Prix 56.0 93401.301 NaN 230.254185 13.377 8.232 1.029
242001 Charles Leclerc 844 2021 ferrari 6 1051 78 Qatar Grand Prix 57.0 90461.448 NaN 230.254185 13.377 8.232 1.029

242002 rows × 15 columns

In [94]:
#NOW THAT EACH OF THE STATS ARE ADEQUATELY MULTIPLIED BY CIRCUIT DIFF, WE CAN CREATE YEAR AVERAGES
#ADDING A COLUMN THAT REPRESENTS THE AVERAGE LAP SPEED FOR A GIVEN YEAR 
avg_lap = results_recent.groupby(['year', 'Driver'], as_index=False)[['lap_in_milli']].mean()
avg_lap = avg_lap.rename(columns= {"lap_in_milli" : "avg_lap"})
# MERGING BACK INTO RESULTS RECENT 
results_recent= results_recent.merge(avg_lap[['year', 'Driver', 'avg_lap']], how = 'left', on = ['year', 'Driver'] )
# ADDING A COLUMN THAT REPRESENT THE AVERAGE PIT STOP FOR A GIVEN YEAR
avg_pit = results_recent.groupby(['year', 'Driver'], as_index=False)[['stop_in_milli']].mean()
avg_pit = avg_pit.rename(columns= {"stop_in_milli" : "avg_pit"})
# MERGING BACK INTO RESULTS
results_recent= results_recent.merge(avg_pit[['year', 'Driver', 'avg_pit']], how = 'left', on = ['year', 'Driver'] )
# ADDING IN INFORMATION ON AVG FASTEST LAP SPEED 
avg_fastest_speed = results_recent.groupby(['year', 'Driver'], as_index=False)[['fastestLapSpeed']].mean()
avg_fastest_speed  = avg_fastest_speed.rename(columns= {"fastestLapSpeed" : "avg_fastest_speed"})
results_recent= results_recent.merge(avg_fastest_speed[['year', 'Driver', 'avg_fastest_speed']], how = 'left', on = ['year', 'Driver'] )
# DROPPING THE INDIV LAP SPEEDS AND PITS
results_recent = results_recent.drop(columns=['lap_in_milli', 'stop_in_milli', 'fastestLapSpeed'])
# ADDING IN INFORMATION ON AVERAGE STARTING POSITION 
avg_start = results_recent.groupby(['year', 'Driver'], as_index=False)[['starting_position']].mean()
avg_start = avg_start.rename(columns= {"starting_position" : "avg_start"})
results_recent= results_recent.merge(avg_start[['year', 'Driver', 'avg_start']], how = 'left', on = ['year', 'Driver'] )
# ADDING IN INFORMATION ON AVERAGE FINISHING POSITION
avg_finish = results_recent.groupby(['year', 'Driver'], as_index=False)[['finishing_position']].mean()
avg_finish = avg_finish.rename(columns= {"finishing_position" : "avg_finish"})
results_recent= results_recent.merge(avg_finish[['year', 'Driver', 'avg_finish']], how = 'left', on = ['year', 'Driver'] )
# DROPPING INFORMATION ABOUT STARTING AND FINISHING POSTIOIN FOR EACH RACE 
results_recent = results_recent.drop(columns=['starting_position', 'finishing_position'])
In [95]:
# WORKING TOWARDS THE NUMBER OF WINS PER SEASON 
driver_standings=driver_standings.merge(results_recent[['driverId', 'Driver', 'raceId']], on = ['raceId', 'driverId'] )
results_recent= results_recent.merge(driver_standings[['raceId', 'driverId', 'wins']], how = 'left', on = ['raceId', 'driverId'] )
max_wins =results_recent.groupby(['year', 'Driver'], as_index=False)[['wins']].max()
max_wins  = max_wins.rename(columns= {"wins" : "max_wins"})
results_recent= results_recent.merge(max_wins[['year', 'Driver', 'max_wins']], how = 'left', on = ['year', 'Driver'] )
results_recent = results_recent.drop(columns=['wins'])
results_recent = results_recent.drop(columns=['raceId', 'circuitId', 'name', 'lap'])
# I AM DELETING DUPLICATES BECAUSE I WANT YEAR AVERAGES NOT RACE BY RACE DATA
results_recent= results_recent.drop_duplicates(keep='first')
results_recent
Out[95]:
Driver driverId year constructorRef constructorId rank_mult avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins
0 Romain Grosjean 154 2012 lotus_f1 208 1.025 100289.646279 24566.292948 199.093396 7.945919 6.712767 0.0
1 Romain Grosjean 154 2013 lotus_f1 208 1.025 102968.582047 24809.711859 202.512365 8.805052 6.164932 0.0
3365 Romain Grosjean 154 2014 lotus_f1 208 1.025 106758.387631 52423.719854 195.934564 15.319044 10.179991 0.0
5214 Romain Grosjean 154 2015 lotus_f1 208 1.025 99093.832835 25791.357500 191.579452 10.950473 8.363379 0.0
5215 Romain Grosjean 154 2016 haas 210 1.025 100591.930909 77837.086062 202.004696 14.690141 11.126294 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
14778367 Max Verstappen 830 2021 red_bull 9 1.029 94453.505408 144400.000709 216.435296 3.009178 1.832710 10.0
14781616 Lando Norris 846 2021 mclaren 1 1.029 98501.247917 224618.414474 216.060448 6.730722 6.376767 0.0
14784752 Carlos Sainz 832 2021 ferrari 6 1.029 98854.289747 274060.924514 214.915375 7.891423 6.346140 0.0
14788001 Nicholas Latifi 849 2021 williams 3 1.029 98983.331069 232462.748959 210.738190 15.748164 14.351341 0.0
14790501 Charles Leclerc 844 2021 ferrari 6 1.029 99627.140983 234220.483919 219.213507 6.755458 6.536672 0.0

4215 rows × 12 columns

In [96]:
# ADDING INFORMATION ON CONSTRUCTOR STADNINGS DURING A GIVEN YEAR TO EACH DRIVER
# WEIGHTING THE CONSTRUCTORS POINTS BY CIRCUIT DIFFICULTY
race_merge = circuit_rank.merge(races[['raceId', 'circuitId', 'year']], on = ['circuitId'])
constructor_rank = race_merge.merge(constructor_standings[['raceId', 'constructorId', 'points']], on = ['raceId'])
constructor_rank['points'] = constructor_rank['points'] * constructor_rank['rank_mult']
points_szn = constructor_rank.groupby(['constructorId','year'], as_index=False)[['points']].max()
points_szn  = points_szn.rename(columns= {"points" : "points_szn"})
constructor_rank= constructor_rank.merge(points_szn[['year', 'constructorId', 'points_szn']], on = ['year', 'constructorId'])
# MERGING THE POINTS THAT A CONSTRUCTOR HAD IN ONE SEASON BACK INTO RESULTS RECENT
results_recent= results_recent.merge(constructor_rank[['year', 'constructorId', 'points_szn']], on = ['year', 'constructorId'])
results_recent= results_recent.drop(columns = ['rank_mult'])
results_recent= results_recent.drop_duplicates()
results_recent
Out[96]:
Driver driverId year constructorRef constructorId avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins points_szn
0 Romain Grosjean 154 2012 lotus_f1 208 100289.646279 24566.292948 199.093396 7.945919 6.712767 0.0 308.90850
20 Kimi Räikkönen 8 2012 lotus_f1 208 102985.901875 23068.233066 200.537352 7.714533 5.652291 1.0 308.90850
740 Romain Grosjean 154 2013 lotus_f1 208 102968.582047 24809.711859 202.512365 8.805052 6.164932 0.0 321.14250
759 Kimi Räikkönen 8 2013 lotus_f1 208 101190.998704 23729.164971 201.454099 7.523404 4.703663 1.0 321.14250
1386 Romain Grosjean 154 2014 lotus_f1 208 106758.387631 52423.719854 195.934564 15.319044 10.179991 0.0 10.22500
... ... ... ... ... ... ... ... ... ... ... ... ...
83251 Lando Norris 846 2020 mclaren 1 97375.025580 184058.145074 223.010074 9.194036 7.726659 0.0 206.54500
83268 Carlos Sainz 832 2020 mclaren 1 92796.389611 121690.841964 220.809292 8.650389 6.717341 0.0 206.54500
83727 Carlos Sainz 832 2021 ferrari 6 98854.289747 274060.924514 214.915375 7.891423 6.346140 0.0 330.77875
83749 Charles Leclerc 844 2021 ferrari 6 99627.140983 234220.483919 219.213507 6.755458 6.536672 0.0 330.77875
84607 Bruno Senna 811 2011 renault 4 104557.103788 23702.469609 204.941454 11.891747 14.222956 0.0 74.42350

243 rows × 12 columns

For reference throughout the analysis, I have created a table that shows which team our recent drivers have raced for since 2010. If there is a NaN in this table, it represents that the driver was not driving this year

In [97]:
# I want to create a graph that will show which drivers have switched teams within this dataset
# Creating a pivot table containing information on year, driver, and the team they were racing for
results_pivot = results_recent.pivot_table(
    index="Driver", columns =['year'], 
    values = "constructorRef", aggfunc=np.max
)
results_pivot
Out[97]:
year 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Driver
Alexander Albon NaN NaN NaN NaN NaN NaN NaN NaN NaN toro_rosso red_bull NaN williams
Antonio Giovinazzi NaN NaN NaN NaN NaN NaN NaN sauber NaN alfa alfa alfa NaN
Bruno Senna hrt renault williams NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Carlos Sainz NaN NaN NaN NaN NaN toro_rosso toro_rosso toro_rosso renault mclaren mclaren ferrari ferrari
Charles Leclerc NaN NaN NaN NaN NaN NaN NaN NaN sauber ferrari ferrari ferrari ferrari
Christian Klien hrt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Daniel Ricciardo NaN hrt toro_rosso toro_rosso red_bull red_bull red_bull red_bull red_bull renault renault mclaren mclaren
Daniil Kvyat NaN NaN NaN NaN toro_rosso red_bull toro_rosso toro_rosso NaN toro_rosso alphatauri NaN NaN
Esteban Gutiérrez NaN NaN NaN sauber sauber NaN haas NaN NaN NaN NaN NaN NaN
Esteban Ocon NaN NaN NaN NaN NaN NaN manor force_india force_india NaN renault alpine alpine
Felipe Massa ferrari ferrari ferrari ferrari williams williams williams williams NaN NaN NaN NaN NaN
Fernando Alonso ferrari ferrari ferrari ferrari ferrari mclaren mclaren mclaren mclaren NaN NaN alpine alpine
George Russell NaN NaN NaN NaN NaN NaN NaN NaN NaN williams williams williams mercedes
Guanyu Zhou NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN alfa
Jack Aitken NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN williams NaN NaN
Jenson Button mclaren mclaren mclaren mclaren mclaren mclaren mclaren mclaren NaN NaN NaN NaN NaN
Kevin Magnussen NaN NaN NaN NaN mclaren mclaren renault haas haas haas haas NaN haas
Kimi Räikkönen NaN NaN lotus_f1 lotus_f1 ferrari ferrari ferrari ferrari ferrari alfa alfa alfa NaN
Lance Stroll NaN NaN NaN NaN NaN NaN NaN williams williams racing_point racing_point aston_martin aston_martin
Lando Norris NaN NaN NaN NaN NaN NaN NaN NaN NaN mclaren mclaren mclaren mclaren
Lewis Hamilton mclaren mclaren mclaren mercedes mercedes mercedes mercedes mercedes mercedes mercedes mercedes mercedes mercedes
Mark Webber red_bull red_bull red_bull red_bull NaN NaN NaN NaN NaN NaN NaN NaN NaN
Max Verstappen NaN NaN NaN NaN NaN toro_rosso toro_rosso red_bull red_bull red_bull red_bull red_bull red_bull
Michael Schumacher mercedes mercedes mercedes NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Mick Schumacher NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN haas haas
Nicholas Latifi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN williams williams williams
Nico Hülkenberg williams NaN force_india sauber force_india force_india force_india renault renault renault racing_point NaN aston_martin
Nico Rosberg mercedes mercedes mercedes mercedes mercedes mercedes mercedes NaN NaN NaN NaN NaN NaN
Nyck de Vries NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN williams
Pastor Maldonado NaN williams williams williams lotus_f1 lotus_f1 NaN NaN NaN NaN NaN NaN NaN
Paul di Resta NaN force_india force_india force_india NaN NaN NaN williams NaN NaN NaN NaN NaN
Pierre Gasly NaN NaN NaN NaN NaN NaN NaN toro_rosso toro_rosso toro_rosso alphatauri alphatauri alphatauri
Pietro Fittipaldi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN haas NaN NaN
Robert Kubica renault NaN NaN NaN NaN NaN NaN NaN NaN williams NaN alfa NaN
Romain Grosjean NaN NaN lotus_f1 lotus_f1 lotus_f1 lotus_f1 haas haas haas haas haas NaN NaN
Rubens Barrichello williams williams NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Sebastian Vettel red_bull red_bull red_bull red_bull red_bull ferrari ferrari ferrari ferrari ferrari ferrari aston_martin aston_martin
Sergey Sirotkin NaN NaN NaN NaN NaN NaN NaN NaN williams NaN NaN NaN NaN
Sergio Pérez NaN sauber sauber mclaren force_india force_india force_india force_india force_india racing_point racing_point red_bull red_bull
Stoffel Vandoorne NaN NaN NaN NaN NaN NaN mclaren mclaren mclaren NaN NaN NaN NaN
Valtteri Bottas NaN NaN NaN williams williams williams williams mercedes mercedes mercedes mercedes mercedes alfa
Vitantonio Liuzzi force_india hrt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Yuki Tsunoda NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN alphatauri alphatauri
  1. Based on the table above, I picked four drivers and have decided to do a regression discontinuity analysis on five important success metrics: average lap time, average pitstop time, average fastest lap speed, average starting position, and average finishing position. These averages are in relation to the year as a whole. I choose the four drivers based on their noterity, the amount of data they had and whether they switched teams at least once in their career. In the end, I choose Max Verstappen, Lewis Hamilton, Sebastian Vettel, and Daniel Ricciardo

    First I will create a graph that shows how these four drivers switched teams over the year. This graph is purely for visuals.

In [99]:
results_four = results_recent.loc[(results_recent['Driver']=="Daniel Ricciardo") |  (results_recent['Driver']== "Lewis Hamilton" )| (results_recent['Driver']== "Max Verstappen") | (results_recent['Driver']== "Sebastian Vettel")]
In [100]:
# Now we are going to create a line plot that shows how these drivers moved over the years 
# To limit how crezy the graph is, I am limiting my regression discontinuity to four of the drivers. The drivers had more data and at least one team switch 
plt.figure(figsize=(40,20))
sns.lineplot(data=results_four, x="year", y="constructorRef", hue="Driver", style = "Driver", markers = True, estimator=None,linewidth=10).set(title='The Four Drivers and their team history since 2010')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 25})
Out[100]:
<matplotlib.legend.Legend at 0x7fb192d62430>
  1. Prove Regression Discontinuity for Constructors: I will now analyze these four drivers and see how their averages were affected pre and post switches

Where did my idea for constructor v. driver come from?

Regression Discontinuity Explanation

Screen Shot 2022-12-15 at 11.37.34 AM.png

In 2015 Max Verstappen was racing for Torro Rosso, who was ranked 7th overall. In 2016, he switched teams and raced for RedBull, who was ranked 4th in 2015. We want to see if Max's average lap time, average pitstop time, fastest Lap speed and other season indicators were affected by switching to a 'better' team.

In [101]:
max_verstappen = results_recent.loc[(results_recent['Driver'] == 'Max Verstappen')]
max_verstappen
Out[101]:
Driver driverId year constructorRef constructorId avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins points_szn
36296 Max Verstappen 830 2019 red_bull 9 92890.720861 24075.109969 211.746296 4.592766 3.472382 3.0 426.38250
39166 Max Verstappen 830 2016 red_bull 9 103857.876384 132095.957048 203.453661 5.170951 4.378560 1.0 478.53000
58325 Max Verstappen 830 2017 red_bull 9 93926.217449 24463.751200 209.506335 6.608790 4.046239 2.0 376.28000
59126 Max Verstappen 830 2018 red_bull 9 93873.307916 23895.447615 213.477761 6.849136 3.268996 2.0 428.42750
63185 Max Verstappen 830 2022 red_bull 9 99668.420097 122141.051468 212.057342 3.367652 2.745595 15.0 776.07750
64087 Max Verstappen 830 2015 toro_rosso 5 99393.162132 26279.162439 197.828022 11.127288 8.614760 0.0 68.50750
64809 Max Verstappen 830 2016 toro_rosso 5 103857.876384 132095.957048 203.453661 5.170951 4.378560 1.0 64.41750
81912 Max Verstappen 830 2020 red_bull 9 98294.251748 128399.091673 220.661747 3.163371 2.192080 2.0 326.17750
82393 Max Verstappen 830 2021 red_bull 9 94453.505408 144400.000709 216.435296 3.009178 1.832710 10.0 598.67375
In [102]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Max Verstappen Yearly Averages (centered at 2016)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.33.44 PM.png

Screen Shot 2022-12-15 at 11.39.59 AM.png

In 2014 Sebastian Vettel was racing for Red Bull, who was ranked 2nd overall. In 2015, he switched teams and raced for Ferrari, who was ranked 4th in 2014. We want to see if Sebastian's average lap time, average pitstop time and fastest Lap speed was affected by switching to a 'lesser' team.

Additionally, in 2020 Sebastian was racing for Ferrari, who was ranked 6th overall. In 2021, he switched teams and raced for Aston Martin, who would be new to the grid in 2021. We want to see how this new team affected hisaverage lap time, average pitstop time and fastest Lap speed.

In [103]:
seb_vettel = results_recent.loc[(results_recent['Driver'] == 'Sebastian Vettel')]
seb_vettel
Out[103]:
Driver driverId year constructorRef constructorId avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins points_szn
10591 Sebastian Vettel 20 2015 ferrari 6 98706.900601 24936.426311 201.048768 5.378575 3.100642 3.0 437.6300
11315 Sebastian Vettel 20 2016 ferrari 6 103831.150217 163706.859425 200.397202 6.209918 3.977668 0.0 406.9550
12154 Sebastian Vettel 20 2017 ferrari 6 94435.559606 61427.478186 212.397571 3.230781 2.511592 5.0 533.7450
12955 Sebastian Vettel 20 2018 ferrari 6 94082.281011 24736.222034 213.342204 2.940756 3.049937 5.0 583.8475
24691 Sebastian Vettel 20 2022 aston_martin 117 100644.724406 164838.975955 204.454166 13.152550 10.757061 0.0 56.2375
25593 Sebastian Vettel 20 2010 red_bull 9 102820.057337 NaN 204.610587 1.976634 3.043656 5.0 509.2050
26315 Sebastian Vettel 20 2011 red_bull 9 107680.068632 22614.213102 203.486539 1.290685 1.584833 11.0 662.6750
26999 Sebastian Vettel 20 2012 red_bull 9 102676.173026 22344.125045 201.710089 5.125028 4.263296 5.0 468.9700
27759 Sebastian Vettel 20 2013 red_bull 9 100254.564253 22572.253170 203.225790 2.067007 1.617363 13.0 607.6220
28443 Sebastian Vettel 20 2014 red_bull 9 106651.695575 53107.073610 200.532725 7.711745 4.898647 0.0 414.1125
29127 Sebastian Vettel 20 2019 ferrari 6 93177.196921 25359.023716 213.452095 4.529317 4.863097 1.0 515.3400
75596 Sebastian Vettel 20 2021 aston_martin 117 99667.004061 236639.551035 213.443439 11.564461 10.210282 0.0 79.2330
76454 Sebastian Vettel 20 2020 ferrari 6 98112.007694 150191.974726 218.887297 12.069066 10.399663 0.0 134.9300
In [104]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Sebastian Vettel's Yearly Averages (centered at 2015 & 2021)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.41.34 PM.png

When Sebastian switched from RedBull (2) to Ferrari (4), his average lap time, average pitstop time decreased and his fastest lap speed remained unaffected. When Sebastian switched from Ferrari (6) to Aston Martin (unranked) his average lap time, average pitstop time, and average fastest lap speed decreased.

Screen Shot 2022-12-15 at 11.47.55 AM.png

In 2012 Lewis Hamilton was racing for McLaren, who was ranked 3rd overall. In 2013, he switched teams and raced for Mercedes, who was ranked 5th in 2012. We want to see if Lewis's average lap time, average pitstop time and fastest Lap speed was affected by switching to a 'lesser' team.

In [105]:
hamilton = results_recent.loc[(results_recent['Driver'] == 'Lewis Hamilton')]
hamilton
Out[105]:
Driver driverId year constructorRef constructorId avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins points_szn
19729 Lewis Hamilton 1 2017 mercedes 131 96237.801009 60394.480514 211.464884 4.119498 2.933767 9.0 683.03000
20530 Lewis Hamilton 1 2018 mercedes 131 94157.645918 23372.101942 213.118778 2.863300 1.932872 11.0 669.73750
21370 Lewis Hamilton 1 2019 mercedes 131 93475.237804 25603.376485 214.407015 2.370085 2.485981 11.0 755.62750
51582 Lewis Hamilton 1 2013 mercedes 131 100970.961314 22743.570989 201.097396 3.496429 5.243343 1.0 367.02000
52266 Lewis Hamilton 1 2014 mercedes 131 103693.060117 54119.550333 199.621863 4.896628 1.412480 11.0 716.77250
52950 Lewis Hamilton 1 2015 mercedes 131 97832.248123 24035.000321 202.845222 1.518149 1.797954 10.0 718.81750
53674 Lewis Hamilton 1 2016 mercedes 131 103225.947602 153754.495151 203.726327 3.974037 1.953633 10.0 782.21250
54493 Lewis Hamilton 1 2010 mclaren 1 103801.934421 NaN 203.973802 5.266207 3.721850 3.0 464.21500
55215 Lewis Hamilton 1 2011 mclaren 1 101001.788928 23010.866357 200.650904 3.712686 3.608099 3.0 506.69150
55899 Lewis Hamilton 1 2012 mclaren 1 101203.446936 23057.715292 200.751359 4.531232 4.522296 4.0 385.37100
56659 Lewis Hamilton 1 2022 mercedes 131 99303.265761 146032.938013 210.471562 6.579889 5.343275 0.0 526.58750
73742 Lewis Hamilton 1 2020 mercedes 131 99452.644919 199180.796000 225.353523 1.940218 1.871872 11.0 585.89250
74240 Lewis Hamilton 1 2021 mercedes 131 98093.040521 218373.463074 217.538994 3.229708 2.728978 8.0 627.30375
In [106]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Lewis Hamilton Yearly Averages (centered at 2013)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.49.30 PM.png

Screen Shot 2022-12-15 at 12.03.33 PM.png

Daniel Ricciardo has has spent his relatively short Formula One career split between five teams: HRT, Torro Rosso, RedBull, Renault, and McLaren. For simplicity, I will only analyze the largest jumps of his career: RedBull to Renault. After driving for RedBull for five years, Ricciardo switched in 2019 from a team ranked 3rd to Renauly who was ranked 5th. Although this may not seem like a large jump, the top three teams in Formula One consistently score the best and the remaining seven tend to fall to the wayside. It is uncommon to see a driver in their prime voluntarily switch to a lower team and thus I want to see how Ricciardo's average lap time, average pitstop, and fastest lap times were affected.

In [107]:
ricciardo = results_recent.loc[(results_recent['Driver'] == 'Daniel Ricciardo') & ((results_recent['year']>=2014) & (results_recent['year']<=2020))]
ricciardo
Out[107]:
Driver driverId year constructorRef constructorId avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins points_szn
28462 Daniel Ricciardo 817 2014 red_bull 9 104509.574582 55182.571225 187.834560 6.167052 3.317521 3.0 414.1125
35036 Daniel Ricciardo 817 2019 renault 4 93956.783509 26116.132420 209.045605 10.780369 7.813078 0.0 93.0475
38400 Daniel Ricciardo 817 2015 red_bull 9 98829.884468 25453.449895 198.220153 8.346150 7.955068 0.0 191.2075
39124 Daniel Ricciardo 817 2016 red_bull 9 103180.943178 126186.733173 203.268686 3.921764 4.426687 1.0 478.5300
58305 Daniel Ricciardo 817 2017 red_bull 9 97994.963205 64947.033344 211.871326 7.172349 3.242691 1.0 376.2800
59105 Daniel Ricciardo 817 2018 red_bull 9 94059.440750 24493.402875 211.325758 7.972252 3.072731 2.0 428.4275
79183 Daniel Ricciardo 817 2020 renault 4 98314.729630 211034.771552 222.018786 7.771777 6.770106 0.0 185.0725
In [108]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Daniel Ricciardo's Yearly Averages (centered at 2019)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.54.15 PM.png

Regression Discontinuity Analysis Conclusions

For the most part, the drivers and their switches followed our expectations. When Max Verstappen switched to a 'better' team, his success indicators were immediately affected positively and he continued to have larger positive affect each year. When Daniel Ricciardo switched to a 'worse; team, his success indicators were negatvily affected in the ensuring season.

There were certain anomolies in Lewis Hamilton's switch from McLaren to Mercedes. Although Mercedes was a lower ranking team in the previous season, Lewis's switch had positive affects on his success indicators.

Additionally, Sebestian Vettel's first switch from RedBull to Ferrari did not follow expectations, but his second switch from Ferrari to Aston Martin did follow expectations.

What we can see from this is that constructors do play a role in success, however Formula One is a highly volitile sport and teams can change drastically year to year, espicially ones with a strong historical footprint like Mercedes or Ferrari.

Going forward, I will still incorporate constructor success into my model but I will not rely on it solely.

Screen Shot 2022-12-15 at 11.24.40 PM.png

In [110]:
results_recent.sort_values(by=['Driver', 'year'], inplace=True)
results_recent['prev_szn(avg_lap)'] = results_recent['avg_lap'].shift(1)
results_recent['prev_szn(avg_pit)'] = results_recent['avg_pit'].shift(1)
results_recent['prev_szn(avg_fastest_speed)'] = results_recent['avg_fastest_speed'].shift(1)
results_recent['prev_szn(avg_start)'] = results_recent['avg_start'].shift(1)
results_recent['prev_szn(avg_finish)'] = results_recent['avg_finish'].shift(1)
results_recent['prev_szn(constructor)'] = results_recent['constructorRef'].shift(1)
results_recent['prev_szn(points_szn)'] = results_recent['points_szn'].shift(1)

# WE NEED TO GET DUMMIES FOR EACH OF THE INSTRUCTORS 
results_recent.dropna(inplace=True)

results_recent
Out[110]:
Driver driverId year constructorRef constructorId avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins points_szn prev_szn(avg_lap) prev_szn(avg_pit) prev_szn(avg_fastest_speed) prev_szn(avg_start) prev_szn(avg_finish) prev_szn(constructor) prev_szn(points_szn)
40684 Alexander Albon 848 2019 toro_rosso 5 94843.825959 25552.474290 211.029878 9.459557 8.679614 0.0 86.91250 94843.825959 25552.474290 211.029878 9.459557 8.679614 red_bull 426.38250
81895 Alexander Albon 848 2020 red_bull 9 97423.053286 179393.963157 223.676150 7.661982 7.711942 0.0 326.17750 94843.825959 25552.474290 211.029878 9.459557 8.679614 toro_rosso 86.91250
60869 Alexander Albon 848 2022 williams 3 97142.944424 55795.668276 206.673985 15.976145 12.215660 0.0 8.19200 97423.053286 179393.963157 223.676150 7.661982 7.711942 red_bull 326.17750
15538 Antonio Giovinazzi 841 2017 sauber 15 97390.015845 29565.767000 208.371720 16.494862 11.663793 0.0 5.13000 97142.944424 55795.668276 206.673985 15.976145 12.215660 williams 8.19200
13795 Antonio Giovinazzi 841 2019 alfa 51 95307.730136 26546.434273 208.695616 14.113988 13.626046 0.0 58.28250 97390.015845 29565.767000 208.371720 16.494862 11.663793 sauber 5.13000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73725 Valtteri Bottas 822 2020 mercedes 131 96888.970915 179747.603157 225.861546 2.351822 4.302174 2.0 585.89250 93671.399409 24458.736309 214.528297 3.770358 2.594020 mercedes 755.62750
74218 Valtteri Bottas 822 2021 mercedes 131 97789.887325 182645.088333 219.127120 5.224945 4.198383 1.0 627.30375 96888.970915 179747.603157 225.861546 2.351822 4.302174 mercedes 585.89250
14636 Valtteri Bottas 822 2022 alfa 51 99739.226565 174877.833435 207.186386 10.372971 9.218129 0.0 56.23750 97789.887325 182645.088333 219.127120 5.224945 4.198383 mercedes 627.30375
77428 Yuki Tsunoda 852 2021 alphatauri 213 102057.050888 254802.358150 213.634943 11.868348 11.231716 0.0 145.19500 99739.226565 174877.833435 207.186386 10.372971 9.218129 alfa 56.23750
37137 Yuki Tsunoda 852 2022 alphatauri 213 101343.794141 124908.809967 207.580001 12.317971 11.091051 0.0 35.78750 102057.050888 254802.358150 213.634943 11.868348 11.231716 alphatauri 145.19500

217 rows × 19 columns

In [149]:
# PICKING THE FEATURES I WANT TO USE IN MY MODEL
features = ["constructorRef", "prev_szn(avg_lap)",
            "prev_szn(avg_pit)", 
            "prev_szn(avg_fastest_speed)", "prev_szn(avg_start)",
            "prev_szn(avg_finish)", "prev_szn(constructor)", "driverId", "points_szn"]
# CHANGING ANY CATEGORICAL VARIABLE INTO A DUMMY
X_dict = results_recent[features].to_dict(orient="records")
# SPECIFYING WHAT MY OUTCOME VARIABLE WILL BE 
y = results_recent["avg_finish"]

# SINCE WE WILL BE USING CROSS VALIDATION, MUST SPECIFY THE PIPELINE 
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=5)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
In [150]:
# ESTABLISH THE FORMULA WE WILL USE FOR FIVE FOLD CROSS VALIDATION
def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_dict, y, 
        cv=5, scoring="neg_mean_squared_error"
    ))
    return mse
    
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()
Out[150]:
8      7.182978
7      7.311592
10     7.384043
9      7.390593
12     7.420570
11     7.476093
6      7.477737
5      7.677406
4      7.678093
13     7.696386
14     7.821234
15     7.950701
16     7.960301
17     8.079358
19     8.082446
20     8.105191
18     8.125320
21     8.145914
22     8.229979
25     8.233796
23     8.254357
2      8.259490
24     8.262089
26     8.291529
3      8.305774
27     8.363079
28     8.509334
29     8.584867
32     8.621233
30     8.643802
31     8.666332
33     8.725684
34     8.790572
35     8.877698
36     8.931591
37     8.980388
38     9.048362
39     9.138213
40     9.208246
41     9.274713
42     9.309658
43     9.363100
44     9.429379
45     9.481925
46     9.492120
47     9.536455
48     9.609352
49     9.700202
50     9.823878
1     10.650024
dtype: float64

Initially, I thought it would be wise to optimize the KNN using cross fold validation because I wanted to minimize the error in the model, however, I realized with a small sample of drivers and construcotrs, the larger K got, the more similar all of the predicted finishing positions got. I've included this in the project, but wanted to first explain why I would not be using the predicted K going forward

For the training set, I used lagged success indicators for each of the drivers. I also limited the training set to not include 2022, as this would be what I tested my predictions on.

In [159]:
# ESTABLISHING X TRAIN, Y TRAIN, AND X TEST 
features = ["year", "driverId", "prev_szn(avg_lap)",
            "prev_szn(avg_pit)", 
            "prev_szn(avg_fastest_speed)", "prev_szn(avg_start)",
            "prev_szn(avg_finish)", "prev_szn(constructor)", "points_szn"]
X_train = pd.get_dummies(results_recent[features])
# I AM INCLUDING 2022 BUT ONLY USING LAGS IN MY MODEL SO THAT THE RACERS IN 2022 CAN STILL PULL THEIR LAG FROM 2021
X_train = X_train.loc[X_train['year']<2022]

y_train = results_recent[["avg_finish", "year"]]
y_train = y_train.loc[y_train['year']<2022]
y_train = y_train.drop(columns=['year'])

X_test = pd.get_dummies(results_recent[features])
X_test = X_test.loc[X_test['year'] ==2022]
In [160]:
from sklearn.neighbors import KNeighborsRegressor
KNN_model = KNeighborsRegressor(n_neighbors=2).fit(X_train,y_train)
X_test['predicted_avg_finish'] = KNN_model.predict(X_test) 
predictions = X_test[['driverId', 'prev_szn(avg_finish)', 'predicted_avg_finish']]
predictions
Out[160]:
driverId prev_szn(avg_finish) predicted_avg_finish
60869 848 7.711942 8.096563
67354 832 6.346140 9.250067
67376 844 6.536672 10.048628
59945 817 8.380792 9.250067
46855 839 9.039821 10.048628
46833 4 8.596266 11.161563
56681 847 11.725527 11.485813
14614 855 4.276638 6.694303
8964 825 12.379080 6.928385
24669 840 9.850742 9.051147
59967 846 6.376767 10.048628
56659 1 2.728978 7.127547
63185 830 1.832710 6.072795
8986 854 16.270225 5.554885
60891 849 14.351341 10.048628
24735 807 7.496708 13.265702
61463 856 2.460180 2.892485
37115 842 7.698309 10.048628
24691 20 10.210282 10.048628
63163 815 5.874301 8.096563
14636 822 4.198383 11.860135
37137 852 11.231716 11.161563
In [161]:
# MERGING IN DRIVER NAME AND ACTUAL FINISH FOR 2022
results_2022 = results_recent.loc[results_recent['year']==2022] 

predictions = predictions.merge(results_2022[['Driver', 'driverId', 'year', 'avg_finish']], how="inner", on = ['driverId'])
predictions
Out[161]:
driverId prev_szn(avg_finish) predicted_avg_finish Driver year avg_finish
0 848 7.711942 8.096563 Alexander Albon 2022 12.215660
1 832 6.346140 9.250067 Carlos Sainz 2022 3.396151
2 844 6.536672 10.048628 Charles Leclerc 2022 3.245347
3 817 8.380792 9.250067 Daniel Ricciardo 2022 11.079349
4 839 9.039821 10.048628 Esteban Ocon 2022 8.029349
5 4 8.596266 11.161563 Fernando Alonso 2022 7.968330
6 847 11.725527 11.485813 George Russell 2022 4.276638
7 855 4.276638 6.694303 Guanyu Zhou 2022 12.237815
8 825 12.379080 6.928385 Kevin Magnussen 2022 12.401714
9 840 9.850742 9.051147 Lance Stroll 2022 11.063592
10 846 6.376767 10.048628 Lando Norris 2022 7.317157
11 1 2.728978 7.127547 Lewis Hamilton 2022 5.343275
12 830 1.832710 6.072795 Max Verstappen 2022 2.745595
13 854 16.270225 5.554885 Mick Schumacher 2022 13.170827
14 849 14.351341 10.048628 Nicholas Latifi 2022 14.804669
15 807 7.496708 13.265702 Nico Hülkenberg 2022 14.996916
16 856 2.460180 2.892485 Nyck de Vries 2022 9.108000
17 842 7.698309 10.048628 Pierre Gasly 2022 10.713856
18 20 10.210282 10.048628 Sebastian Vettel 2022 10.757061
19 815 5.874301 8.096563 Sergio Pérez 2022 3.987783
20 822 4.198383 11.860135 Valtteri Bottas 2022 9.218129
21 852 11.231716 11.161563 Yuki Tsunoda 2022 11.091051

We know that George Russel switched from Williams to Mercedes in 2022. We will use his data from 2022 and see what our model predicts his average finishing position to be. Our model predicted that Russell would average a finishing position of 11.72 but he actually finished with average position of 4.27. This could be because our model fails to accurately weight the affect that switching to a better team has on a driver's success. Another switch was Vatteri Battas who transferred from Mercedes to Alfa. Our model predicted that he would have an average finishing position of 11.06 but he actually finished with an average finish of 9.05. This is a much closer gap and reflects the negative affect of switching to a worse team. This higher prediction could be caused by his high finishing position in the previous season.

Final Steps

Screen Shot 2022-12-15 at 11.37.14 PM.png

All in all, I have come to the conclusion that it is very difficult to finetune finishing position. My model was limited by the amount of public data available, the fact that constructors remodel their car year to year, and that when we use lags, we cannot correctly gauge the affect of a switch because our model is using metrics from the previous season (and thus the previous team) to predict how a driver will perform with a new team. These limitation led to large gaps between predicted average finish and average finish. Additionally, our model was also limited by the fact that there are only 20 racers within a given season and thus using a large KNN centers all of the predictions around 7-10 regardless of the driver and their personal ability.

In [ ]:
# Converting my CoLab to html so that I can upload this file to Github
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/finaltutorial.ipynb
[NbConvertApp] Converting notebook /content/drive/MyDrive/finaltutorial.ipynb to html
[NbConvertApp] Writing 8018831 bytes to /content/drive/MyDrive/finaltutorial.html
Out[ ]: